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Abstract 


Background: The CCCTCbinding factor (CTCF) has diverse regulatory functions. However, the definitive characteristics 
of the CTCF binding motif required for its functional diversity still remains elusive. 


Results: Here, we describe a new motif discovery workflow by which we have identified three CTCF binding moti 
variations with highly divergent functionalities. 
Supported by transcriptomic, epigenomic and chromatin-interactomic data, we show that the functional diversity of 
the CTCF binding motifs is strongly associated with their GC content, CpG dinucleotide coverage and relative DNA 
methylation level at the 12th position of the motifs. Further analysis suggested that the co-localization of cohesin, the 


high CpG. 


features at the 12th position of the motifs. 


key factor in cohesion of sister chromatids, is negatively correlated with the CoG coverage and the relative DNA 
methylation level at the 12th position. Finally, we present evidences for a hypothetical model in which chromatin 
interactions between promoters and distal regulatory regions are likely mediated by CTCFs binding to sequences with 


Conclusion: These results demonstrate the existence of definitive CTCF binding motifs corresponding to CTCF's 
diverse functions, and that the functional diversity of the motifs is strongly associated with genetic and epigenetic 


Keywords: CTCF, Binding motif, DNA methylation, Chromatin interaction 


Background 

The CCCTC-binding factor (CTCF) is an 11-zinc-finger 
protein that is functionally conserved in vertebrates, 
insects and nematodes [1]. It has been shown to be 
involved in various critical biological processes and has 
long been regarded as a versatile regulator. CTCF knock- 
out mice exhibit early embryonic lethality prior to im- 
plantation [2, 3]. In adult organisms, the CTCF protein 
is ubiquitously expressed across most tissues, and its 
expression levels vary in a cell type-specific manner, 
suggesting a role in maintaining phenotypic diversity [4]. 
Furthermore, CTCF depletion results in dysregulated 
transcription of hundreds of genes in oocytes [5] and 
dramatically deregulates cell-cycle progression during T 
lymphocyte lineage commitment within the thymus [3]. 
Other studies have suggested that CTCF impacts cellular 
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activity by playing diverse roles in gene regulation, 
including promoter activation/repression, genomic im- 
printing, enhancer blocking, and, most recently, long- 
range chromatin interactions [2, 4]. 

Recent genome-wide mapping of CTCF occupancy in 
multiple cell lines has enabled the identification of 
CTCF binding sequences [6-9]. Several hypotheses have 
been proposed to explain the uniquely versatile charac- 
teristics of CTCF based on the underlying binding motif 
sequence. The most widespread is the zinc-finger model, 
in which the capacity of CTCF to confer vastly different 
functions has been attributed to the interplay between 
the zinc-finger engagement and the underlying sequence 
[4]. Early studies in which zinc-fingers were depleted in 
a stepwise manner reported involvement of multiple 
zinc-fingers within a broad ~50 bp sequence [1], indicat- 
ing that CTCF binding may be partially stabilized by 
interplay between peripheral zinc-fingers and other co- 
factors. Following this, a ~11-15 bp core consensus 
sequence was identified as bound by 4-5 central zinc- 
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fingers [10]. A variation at the 12th bp position of the 
consensus sequence was found to be tightly associated 
with the DNA methylation level of the binding site 
which in turn was associated with remodeling of CTCF 
binding in immortalized cells [9]. Several studies have 
suggested that the functional diversity of CTCF is associ- 
ated with its sequence variations [11, 12], but the mech- 
anism by which sequence variation determines its 
function remains unexplored. Consequently, we sought 
to identify and characterize the variation in CTCF bind- 
ing sequences and its relation to the CTCF functional 
spectrum. Using a newly developed motif discovery 
workflow, we were able to distinguish three different 
CTCF binding motifs. Integrative analyses of data on 
transcription factors, histone modifications, chromatin 
conformations and gene expression across multiple cell 
lines suggest distinct functionalities of these three CTCF 
binding sequence variations. In particular, our analysis 
revealed that CpG coverage and methylation status at 
the 12th position of the CTCF binding motifs have a 
marked effect on the colocalization of cohesin, which in 
turn implies that the variations in the CTCF binding se- 
quence mediate different effects on chromatin interac- 
tions. To test this assumption, we examined the effects 
of the three CTCF binding sequence variations in rela- 
tion to chromatin interactions, and found that chroma- 
tin interactions between promoters and distal regulatory 
elements tends to be mediated by CTCFs that bind to 
motifs with higher CpG content. 


Results and discussion 

CTCF binding sequence variations detected by a motif 
discovery workflow 

It has been shown that the functional diversity of CTCF 
may be associated with the occupancy of the protein at 
its binding sites [11, 12]. This led us to the hypothesis 
that the functional diversity of CTCF may result from its 
binding affinity, and thus be influenced by the variations 
in the CTCF binding motifs. To obtain an extensive 
sample of the variations of CTCF binding motifs, we 
revisited the genome-wide data of chromatin immuno- 
precipitation of CTCF followed by high-throughput 
sequencing (ChIP-seq) from the ENCODE Project [13, 14]. 
While many of the CTCF binding sites were observed 
to be bound across all or most cell types, ~20—50 % of 
CTCF sites showed some level of cell type specific 
binding [6, 15]. However, cell type specific CTCF bind- 
ing sites have recently been shown to be less occupied 
than constitutive sites [12], indicating that cell type spe- 
cific binding of CTCF is less stable and weaker than 
constitutive binding, which implies the possibility that 
the detection of cell type specific CTCF binding events 
may have a higher false positive rate. In order to obtain 


Page 2 of 12 


high confidence CTCF binding sites, we extracted 
CTCF ChIP-seq peaks that appeared in all the 12 tested 
cell lines [16-18] (Additional file 1: Table S1) and 
regarded these as “constitutive” binding sites. A total of 
12,761 constitutive CTCF binding sites were detected 
by MACS [19] with a threshold FDR < 0.01. Sequences 
of 200 bp in length centered at the summit of each 
CTCF binding peak were extracted and pooled for 
motif discovery. In the rest of this paper, if not stated 
otherwise, the analysis was performed on the aforemen- 
tioned constitutive CTCF binding sites. 

Different from previous motif generation methods that 
attempt to obtain a maximal description for all pooled 
sequences as a whole [20, 21], our workflow searches for 
multiple motifs, balancing the number of sequences each 
motif represents and the divergent nature of the recogni- 
tion motifs of a given binding protein. In other words, 
our workflow detects motifs that represent the consen- 
sus of mutually exclusive subsets of pooled sequences, 
and the motifs consequently represent sequence subsets 
that could be very different. Our workflow consists of 
five major steps: (1) motif generation; (2) motif evalu- 
ation; (3) sequence elimination; (4) motif updating; and 
(5) a stopping criterion (Fig. 1a, Table 1 and Additional 
file 2). 

We applied this workflow to the CTCF binding se- 
quences described above, and identified three distinct 
CTCF core motifs of high confidence (Fig. 1b, E-values 
< 2.0E-457, < 3.0E-609 and < 2.1E-519:; g-values < 7.0E- 
10, < 1.0E-4 and < 7.0E-4, for the three motifs, respect- 
ively). The details concerning iteration and conver- 
gence of the process are shown in Additional file 3: 
Figure S1. We named the three CTCF binding se- 
quence variations as CTCF-A, CTCF-B and CTCF-C, 
respectively (Additional file 4: Dataset S1). CTCF-A 
motifs constituted the largest fraction of the motifs 
(57 %), while 24 and 14 % of the CTCF binding sites 
contained the CTCF-B and CTCF-C motifs, respect- 
ively (Fig. 1d). The remaining 5 % of the binding sites 
did not fall into any of the three categories above, and 
were excluded from subsequent analyses. At the DNA 
sequence level, the three motifs did not show substan- 
tial divergence (Fig. 1), although the GC content in the 
informative sites of CTCF-A was moderately larger 
than in the CTCF-B and CTCF-C motifs. We then ex- 
tracted the flanking region [-100 bp, +100 bp] centered at 
each CTCF binding motif, and calculated the binding in- 
tensity by counting the number of ChIP-seq reads that 
mapped within the flanking region. The binding affinity 
differed significantly among the three motifs (Additional 
file 5: Figure S2), suggesting three CTCF motifs may con- 
fer distinct functionalities. We therefore sought to investi- 
gate whether there is functional divergence among CTCF 
binding sites containing the three different motifs. 
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Actively expressed genes are predominantly associated 
with CTCF-A 
To investigate the association between gene expres- 
sion activity and the occupancy of CTCF binding 
sites, we used a set of histone modification data from 
the Broad Institute [22] available from the ENCODE 
project (Additional file 6: Table S2), in combination 
with counts of histone marks in the flanking regions 
of each CTCF binding site (Fig. 2, Fig. 3a). We ob- 
served that regions close to CTCF-A binding sites 
were highly enriched for chromatin features that have 
been associated with active regulatory genome ele- 
ments. For example, an enrichment of H3K4me3 and 
H3K9ac is considered a strong indication of an active 
promoter [23, 24], and an enrichment of H3K27ac 
separates active from poised enhancers [25]. In con- 
trast, the vicinity of CTCF-A sites were depleted of 
repressive chromatin marks, such as H3K27me3 [26]. 
We did not observed any significant enrichment of 
particular chromatin features in the vicinity of CTCF- 
B and CTCF-C binding sites. This pattern was even 
more pronounced for tissue specific CTCF binding sites 
(Additional file 7: Figure $3), in that tissue specific CTCF- 
A binding sites showed a higher frequency of active chro- 
matin marks than constitutive CTCF-A binding sites 
(Wilcoxon-Rank-Sum test p-value = 0.037 and 0.012 for 
H3K27ac and H3K4me3). 

These results prompted us to investigate the associ- 
ation between gene expression and the occupancy of 
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Fig. 1 Motif Discovery Workflow and three detected CTCF sequence variations. a Cartoon of the motif discovery workflow (see Methods for details 
and Table 1 for the pseudocode of the algorithm). b Motif Logo for three core CTCF consensus sequences (1, 2, 3 for CTCF-A, CTCF-B and CTCF-C 
motif, respectively). The stars (*) in the first row indicate the most informative sites. c Average GC content at the most informative sites of the three 
CTCF binding sequence variations. d Distribution of three CTCF binding sequence variations 
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CTCF binding sequence variations. We divided all the 
coding genes into four groups based on CTCF occu- 
pancy in the flanking regions of their transcriptional 
start sites (TSS), referred to as CTCF-A-Linked (AL), 
CTCF-B-Linked (BL), CTCF-C-Linked (CL) or ‘other’ 
genes. Excluding ChIP-seq peaks that did not corres- 
pond to any of these three CTCF binding motifs, we 
assigned each of the 9822 genes annotated by ENSEM- 
BLE into a group corresponding to its nearest CTCF 
binding motif within a [-50kbp,+50kbp] region of its 
TSS, or, if none, as ‘other genes’. We examined the tran- 
scriptional activities of the genes of each group in the 
GM12878, K562 and HeLaS3 cell lines based on RNA- 
seq data available from ENCODE [27]. Most of the ex- 
amined genes were transcriptionally silent; however, in 
all three cell lines, we found that the expression levels of 
AL genes were, in general, higher than those of BL (Wil- 
coxon test, p-value = 1.93E-4, 1.66E-2 and 4.96E-4 for 
GM12878, K562 and HeLaS3 cells, respectively) and CL 
genes (Wilcoxon test, p-value =2.12E-3, 1.65E-2 and 
1.45E-2 for GM12878, K562 and HeLaS3 cells, respect- 
ively). Taken together, these results suggest that CTCF-A 
binding sites are more frequently involved in active gene 
transcriptional regulation than the two other types of 
sites. 


Functional diversity of CTCF sequence variations 
As the next step, we studied the overlap between the 
CTCF sites and annotations of regulatory regions. Twelve 
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Table 1 Pseudo-code for the motif discovery workflow. The details for Motif Evaluation, Sequence Elimination and Motif Updating 


can be found in the Additional file 2 


Start 


Set Seq to be the initial CTCF sequences; 
Set m = 1 and similarity = 0; 


Set Seqo = NULL and Seq, = Seq; 


while the size of Seq,, is greater than 10% of Seq; 
Begin 


Set Seqm = Seq — S€Gm-1 


While Similarity < 0.98; 


1. Generate motif, from seqm using MEME; 


2. Evaluate newly generated motif, (Motif Evaluation); 
3. Rescan each sequence in Seq using motif, and assign each 


sequence in Seqnm a q-value; 


4. Discard sequences with q-value > 0.01, and set the left sequences as 


Seq’n (Sequence Elimination); 


5. Set Seq’, as the top 100 sequences of ascendingly sorted Seq’, by q- 


values; 


Set Similarity = similarity(Seq’m, Seq”m); 
Set Seqm = Se€q’m; 


© OND 


End while 


Output motif,,; 


Setm=m +1; 


End while 


Step i: Output novel motifs 


End 


Update motif, by motif generated from Seq”, (Motif Updating); 


. If Seqm is no larger than 10% of Seq, Then GOTO step i; 


percent of all CTCF binding sites were located within 3 kb 
regions ranging from 2500 bp upstream to 500 bp down- 
stream of annotated TSSs of coding genes (defined as pro- 
moter regions) while the remaining 88 % of the sites were 
evenly distributed (44 and 44 %) between intergenic and 
intragenic locations (Additional file 8: Figure $4), which is 
consistent with a previous mapping [28]. We further con- 
sidered the overlap with regulatory regions annotated by 


the ENCODE chromatin state pattern [29, 30]. Specific- 
ally, we studied the overlap of CTCF binding sites with 6 
types of genomic annotations, including active promoters, 
weak promoters, poised promoters, strong enhancers, 
poised enhancers, and insulators [29]. To further refine 
the annotation of the chromatin states, we pinpointed 
promoter, enhancer and insulator elements by adding add- 
itional restrictions of location or cofactors. For example, a 
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Fig. 2 Expression of CTCF-linked genes. The distribution of associated gene types for the different CTCF binding sites is shown for all three cell types. 
Active genes are more likely to be linked with CTCF-A, compared with CTCF-B and CTCF-C binding sites. ( and “**” indicate P-value < 0.01and 
P-value < 0.001, respectively). The expression levels are showed in logarithmic transformation 


BL-gene AL-gene CL-gene BL-gene 


HeLaS3 


DNA segment with a promoter-like chromatin state is 
regarded as a true promoter only if it also locates within 
[-2000 bp, +500 bp] of an annotated TSS. Likewise, en- 
hancers and insulators needed to be colocalized with p300 
[31] and cohesin [32, 33], respectively, the data on p300 
and cohesin binding sites were obtained from the EN- 
CODE ChIP-seq dataset [17]. In all three cell lines exam- 
ined, the three CTCF motifs exhibited pronounced 
differences in inferred functionality. Particularly, CTCF-A 
binding sites exhibited a strong tendency to overlap with 
promoters (hypergeometric test p-value < 2E-53, < 9E-57 
and < 2E-55 for GM12878, K562 and H1-hESC, respect- 
ively), while CTCF-C binding showed enrichment for over- 
lap with insulators (hypergeometric test p-value < 7E-28, < 
2E-19 and < 1E-32 for GM12878, K562 and H1-hESC, re- 
spectively). On the other hand, CTCF-B did not show en- 
richment for any of the chromatin states we tested 
(Fig. 3b), and we have for this reason, mainly compared 
CTCF-A and CTCF-C in the rest of the paper. 

We can make several predictions based on the inferred 
functionality distributions. First, considering the enrich- 
ment of CTCF-A binding sites in promoter regions 
(Fig. 3c) and the strong association with gene expression 
activity (Fig. 2), we would expect a higher incidence of 
colocalization between CTCF and transcription factor 
(TF) binding. To test this, we integrated ChIP-seq data 
for 20 different transcription factors available from EN- 
CODE (Additional file 9: Table S3) [13, 14] and calcu- 
lated the fold-enrichment of colocalization of these 
transcription factors with the CTCF binding sites com- 
pared to the input signal (see Methods). As shown in 
Fig. 4, in all three cell lines we tested, we observed that 
binding of CTCF to CTCF-A sites colocalized with 


binding of most of the TFs we tested in all three cell 
lines. Second, the engagement of CTCFs in mediating 
transcriptional insulation has been found to be coupled 
with cohesin [33]; therefore, we can expect a subset of 
CTCF binding sites to be colocalized with cohesin. In 
fact, the results show a strong association between 
CTCF-C bindings and cohesins (binding of cohesin was 
defined as the overlapping peaks of its two subunits, 
Rad21 and SMC3) (Fig. 4). We also observed a strong 
association between CTCF-C bindings and Lamina asso- 
ciated domains (LADs), which also represent a repres- 
sive chromatin environment [34]. 

In summary, CTCF-A binding regions were enriched 
for active histone modifications and tended to appear 
within promoter regions. As such, they presumably con- 
sist of active regulatory elements in enhancers and pro- 
moters. In the absence of active histone marks and 
enrichment for insulator-like genome segments, CTCF- 
Cs binding regions possibly function as enhancer- 
blocking insulators. 


The sequence and DNA methylation variation at the ig 
position of the CTCF binding motif 

We next asked what characteristics of the motifs might 
influence the divergence of their functionality. Given the 
differences in GC content among the three motifs, it is 
possible that differential DNA methylation levels at the 
CpG dinucleotide sites may result in different CTCF 
binding affinities. Two positions in the CTCF recogni- 
tion motif have been reported to show an enrichment of 
CpG dinucleotides, with a strong association between 
DNA methylation level and CTCF occupancy [9]. To in- 
vestigate the association between DNA methylation levels 
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distributions of the three CTCF binding sequence variations in the whole genome and in promoter regions ([-2500 bp, +500 bp] to TSS); the star 
(*) indicates a significant difference between the two by the hypergeometric test (p-value < 1E-50; see Additional file 8: Figure S4 for distribution 


of three CI 
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at the two sites and the potential functions of the three se- 
quence variations, we extracted the CpG methylation sta- 
tus in [-50 bp, +50 bp] regions centered at each CTCF 
binding motif [35]. 

CTCF-A sequence regions have a relatively high over- 
all CpG content and high DNA methylation levels at the 
12" position. Since the overall GC content varies among 
the CTCF-A, -B and -C sequences, it was not surprising 
to find that of all CpG dinucleotides detected in the 
three examined cell lines (13,714, 18,105 and 15,241 in 
GM12878, K562 and HeLaS3, respectively), CTCF-A se- 
quences had much higher overall CpG levels than 
CTCF- B and -C sequences (Wilcoxon test p-value < 5e- 
12, < 5e-13 and < 5e-10 for GM12878, K562 and 
HeLaS3, respectively). In agreement with Wang et al. 
[9], we also observed ultrahigh enrichment of CpG dinu- 
cleotides at the 12th position of CTCF recognition se- 
quences particularly in the CTCF-A subgroup (5-fold 


over that in CTCF-C, p-value < le-14; Fig. 5a and 
Additional file 10: Figure S5). Given the high CpG level at 
the 12th position of the CTCF-A sequences, we would 
also expect a correspondingly high DNA methylation 
level. Indeed, the DNA methylation level at the 12th pos- 
ition of CTCF-A binding sites was relatively high com- 
pared to other CpG sites (Additional file 11: Figure S6). 
One puzzling observation is that both GC content and 
DNA methylation levels appear relatively high in and 
around CTCF-A binding sites located in transcription- 
ally active regions. However, this may in part be ex- 
plained by the fact that the overall DNA methylation 
level at the constitutive CTCF binding sites is quite low 
compared to inactive or unbound CTCF sequences. 
When comparing the DNA methylation levels of CTCF- 
A binding sites with control regions, which were defined 
as the genome segments with high confident CTCF- 
A motifs but with little CTCF binding signal (ie., a 
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CTCF signal is less than bottom 5 % across all CTCF 
CHIP-seq peaks; Additional file 12: Figure S7), we 
found that the DNA methylation levels at CTCF-A 
binding sites were much lower than in the control 
regions. 

If the regulation of functional diversity among CTCF 
binding sequence variations is influenced by the relative 
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cell lines 


DNA methylation level at the 12th position, we would 
expect different associations between the DNA methyla- 
tion levels at this position and key cofactors. To test this 
hypothesis, we divided the CTCF binding sites into two 
groups labeled as DNA “methylated” and “unmethylated” 
according to the DNA methylation level at their 12th 
position. Thus, if the DNA methylation level at the 12th 
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position of a CTCF binding site was found to be greater 
than 20 % [36], this binding site would be classified as 
“methylated” otherwise the binding site is classified as 
“unmethylated”. As previously reported [9], we observed 
that the binding intensity of CTCF is negatively corre- 
lated with the methylation level at the 12th position of 
the binding sites (Additional file 13: Figure $8). We also 
examined the colocalization of cohesin with methylated 
and unmethylated CTCFs. Cohesin is one of the most 
important cofactors of CTCF, and exhibits distinct func- 
tionalities in the presence or absence of CTCF [32]. Be- 
cause the peak numbers and genome wide distributions 
of ChIP-seq reads from the two subunits of cohesin 
(Rad21 and SMC3) are substantially different, we took 
only the overlapping binding regions of Rad21 and 
SMC3 to be representative of cohesin binding sites, in- 
vestigating whether the colocalization of cohesin with 
CTCF was associated with DNA methylation level at the 
12th position, we found that cohesin was highly depleted 
from methylated CTCF binding sites (Fig. 5b), the finding 
that is consistent across the three cell lines we examined 
(hypergeometric test p-value < 0.001, < 8E-7 and < 3E-13 
for GM12878, K562 and HeLaS3, respectively). We also 


tested the effect of methylation at other sites within the 
CTCF core binding region [-10 bp, +10 bp], however, the 
12th position was the only site that showed statistically 
significant depletion of cohesin occupancy in all three cell 
lines. Moreover, estimates of the cohesin binding affinity 
(as inferred from ChIP-seq read counts of Rad21 from 
representative cohesin binding sites) in GM12878 and 
HeLaS3 cell lines were negatively correlated with the 
DNA methylation level at the 12th position of the associ- 
ated CTCF binding sequences (bootstrap test, p-value < 
5E-2 and <1E-5 for GM12878 and HeLas3, respectively; 
see Fig. 5c and Methods). The data thus indicate that the 
co-binding of cohesin with CTCF is, to some degree, 
negatively related to DNA methylation at the 12th pos- 
ition of the CTCF binding site. 

Renda and colleagues demonstrated that high-affinity 
binding to a 12 bp variation of the CFCT consensus se- 
quence involved only 4—5 specific zinc-fingers of the 
CTCF [10]. Ren and colleagues reported that 17 % of all 
evolutionary nucleotide changes in the CTCF binding 
sites took place as C-to-T substitutions as a unique nu- 
cleotide change at the 12th position [6]. Our results, 
when combined with those of others [10, 6], imply that 
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the methylation level at the 12th position of CTCF bind- 
ing sequence may alter the binding environment, result- 
ing in different zinc-finger binding and, in turn, 
recruiting different cofactors that ultimately leads to di- 
vergent functionalities. 


Functional diversity of CTCF binding sequence motifs is 
reflected in the genome 3D structure 

Because the CTCF protein is important for mediating 
chromatin-chromatin interactions [4, 37], we next asked 
what connection might exist between CTCF binding site 
variation and the DNA loops in which they are involved. 
To address this question, we revisited the chromatin in- 
teractions database in ENCODE, which is based on data 
obtained with the Chromosome Conformation Capture 
Carbon Copy (5C) method [37, 38]. These data contain 
chromatin interactions between 628 TSS-containing 
fragments and 4443 distal restriction fragments covering 
the ENCODE pilot project regions representing 1 % 
(30 Mb) of the genome for the three cell lines examined 
above (GM12878, K562 and HeLaS3). We first used the 
sequence scanning tool FIMO with default settings to 
search each distal fragment for CTCF-A, -B and -C mo- 
tifs [39]. Each fragment with the CTCF binding type was 
tagged by which had the highest confidence for this frag- 
ment (see Methods). In all three cell lines, we observed 
that fragments tagged with CTCF-A were enriched for 


TSS-distal chromatin interactions, while CTCF-C-linked 
fragments showed depletion of TSS-distal interactions 
compare to CTCF-A-linked fragments (Fig. 6a). CTCF- 
B, however, was enriched for TSS-distal interactions in 
HelaS3 and GM12878, but not in K562. In agreement 
with recent works showing that CTCF is involved in me- 
diating long-range interactions [2, 4, 40], distal frag- 
ments lacking CTCF occupation also showed a low 
tendency to form loops. We further examined the cor- 
relation of 11 histone modifications (Additional file 6: 
Table S2), 20 DNA-binding factors (Additional file 9: 
Table S3) from ENCODE and 2 chromatin domains 
(LAD: Lamina Associated Domains and TAD: Topo- 
logical Associated Domains Boundary) with loop forma- 
tion (see Methods), and observed that TSS-distal 
looping is more strongly correlated with CTCF-A bind- 
ing than with most of these other biological elements 
(Fig. 6b, see Additional file 14: Figure S9 for K562 and 
HeLaS3). 

CTCF-A binding sites appeared more involved in the 
mediation of DNA looping between TSSs and enhancer- 
containing distal fragments than did CTCF-C binding 
sites. We found that cohesin (Rad21 and Smc3) was con- 
sistently and highly positively correlated with TSS-distal 
looping in all three cell lines. Although cohesin was pre- 
viously thought to act as insulator when colocalized with 
CTCF [32, 33], our results are in agreement with recent 
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genome wide studies showing an import role for cohesin 
in establishing enhancer-promoter interactions [33, 41]. 
We also observed enrichment of the NFYA and NFYB 
factors (Fig. 4) around the CTCF-A binding regions. 
These two proteins have been shown to interact with the 
CCAAT/Enhancer Binding Protein (C/EBP) which regu- 
lates gene expression [42]; therefore we further exam- 
ined the enrichment of regulatory genome elements in 
the distal fragments. Compared to all fragments, the dis- 
tal fragments mainly occupied by CTCF-A sequences ex- 
hibited higher enrichment for enhancer-related marks 
(P-value < 0.002, 0.05 and 0.004 for P300, H3K4mel1 and 
annotated enhancers ([29, 30], respectively; Additional 
file 15: Figure $10), but not for H3K4me3. 


Conclusions 

To determine divergent DNA recognition motifs of 
CTCF in various functional genomics contexts, we 
developed a novel motif discovery workflow focusing 
on the balance between the number of sequences a 
motif represents and the internal divergence within 
the sequence subgroups. By applying the workflow to 
the ENCODE ChIP-seq data set, we identified three 
CTCF core motif variations. The expression activity 
patterns of the genes flanking the three CTCF motifs 
showed significant correlations with the GC content 
and the CpG enrichment of the motifs. We also de- 
tected a strong association among the presence of the 
CTCF-A motif and enhancers and promoters defined 
by histone marks, and we further demonstrated that 
this association was supported by chromatin struc- 
tural data from chromatin conformation capture 
based experiments. The functional divergence of the 
motifs was further associated with possible genetic or 
epigenetic variations, in particular with the CpG di- 
nucleotide coverage at the 12" position of the core 
binding motifs and the relative DNA methylation 
level at this site, and the latter two features were also 
strongly and negatively correlated with the colocaliza- 
tion of CTCF with cohesin, the key cofactor of CTCF 
when it functions as an insulator protein. 

These results suggest that the variation in DNA 
methylation level at a single CpG site of the CTCF’s 
recognition motif has a determining influence on the 
divergence of its functions. Alternative preferences for 
critical cofactors, e.g. cohesin, among the different 
motifs suggest potentially multiple molecular mecha- 
nisms for the CTCF functionalities. The workflow we 
have introduced provides a new analytical tool for the 
studies of multifunctional DNA binding proteins, par- 
ticularly for those whose functional classification is 
not yet clearly defined. 
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Methods 

Minimal motif discovery workflow 

Briefly, the workflow iteratively optimizes an object with 
the best motif score that can be found in a sub-dataset. 
See Fig. 1a, Table 1 and supplemental text for details. 


Dataset 

Most of the data used were from the ENCODE Project 
[13, 14]. CTCF occupancy was derived from ChIP-seq 
data from two independent sources [16, 17]. The ChIP- 
seq histone modification signals across the three cell 
lines (GM12878, K562 and HeLaS3) were generated by 
Broad/MGW ENCODE group [22, 29]. The chromatin 
state segmentation for each of the three cell lines was 
acquired by computationally integrating ChIP-seq data 
for nine factors plus input using a Hidden Markov 
Model [29, 30]. The genome-wide binding sites for 24 
different TFs were determined by ChIP-seq [17]. The 
DNA methylation profile was generated by Meissner et 
al. and assayed at more than 500,000 CpG dinucleotides 
in the genome, using Reduced Representation Bisulfite 
Sequencing (RRBS) as a part of the ENCODE Project 
[35]. The chromatin interaction data were generated 
using the Chromatin Conformation Capture Carbon 
Copy (5C) method from the Dekker Lab [37, 38]. 


Histone modification fold-enrichment 

To obtain the most reliable CTCF binding sites, we de- 
termined the binding sites by a combination of CTCF 
ChIP-seq data and motif scanning. Briefly, the peaks of 
CTCF ChIP-seq data were first called by MACS [19] 
with threshold FDR < 0.01. Next, the gained peak regions 
were scanned by the three CTCF motif variations gener- 
ated by our workflow (Fig. 1a, Table 1) using motif scan 
software FIMO [39]. Then, a CTCF binding site in the 
peak region was defined as the motif instance locus hav- 
ing a FIMO E-value<0.01. Regions [-1500 bp, 
+1500 bp] centered at each CTCF binding motif were 
extracted and partitioned into 150 bins of 20 bp each. 
The signal strength of each bin was retrieved from the ori- 
ginal ENCODE bigwig files (Additional file 6: Table S2). 
The fold-enrichment of a histone modification at each bin 
was defined as: 


S/C; 


Where Sj is the strength of the i-th histone modifica- 
tion within the j-th bin, and C; is the strength of Input 
in the j-th bin. 


CTCF colocalization fold-enrichment 
The fold-enrichment of 36 biological elements (Add- 
itional file 6: Table S2 and Additional file 9: Table S3) 
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for colocalization with CTCF binding sites is defined as 
follows: 


P;/C; 


where P; is the percentage (%) of CTCF binding sites 
overlapped by biological element i. and C; is the percent- 
age of control regions overlapped by the same biological 
element i as P;. Control regions were the peaks of the 
ChIP-seq Input experiment also called by MACS with 
FDR <0.01; the CTCF overlapped regions were dis- 
carded. Features were determined to be colocalized with 
CTCF binding sites if they were overlapped by at least 
one nucleotide. 


Pearson correlations between genomic elements and 
looping 

Detected looping events are very sparse in the 5C data; 
only 1.2 % of all distal-TSS pairs contain a significant 
loop (positive set [37]). Therefore, to correlate looping 
events with genomic elements, it is necessary to take the 
sparseness, i.e., the huge number of distal-TSS pairs with 
no significant 5C loop (negative set) into consideration. 
We used a bagging strategy to down-sample the negative 
observations to estimate the distribution of Pearson’s 
correlation coefficient (PCC) between genomic elements 
and 5C looping. In detail, we randomly sampled the 
same number of distal-TSS pairs with no 5C loops to 
form a control dataset, and 1000 such control datasets 
were generated, and the PCC distribution for each 
genomic element was calculated for the 1000 combined 
subsets. 


Availability of supporting data 
All our data have been made available as the online sup- 
porting materials. 


Supporting information 
Detailed information on the minimal motif discovery 
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